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Abstract 

Online learning aims to perform nearly as well as the best hypothesis in hindsight. For some hypothesis classes, 
though, even finding the best hypothesis offline is challenging. In such offline cases, local search techniques are 
often employed and only local optimality guaranteed. For online decision-making with such hypothesis classes, we 
introduce local regret, a generalization of regret that aims to perform nearly as well as only nearby hypotheses. We 
then present a general algorithm to minimize local regret with arbitrary locality graphs. We also show how the graph 
structure can be exploited to drastically speed learning. These algorithms are then demonstrated on a diverse set of 
online problems: online disjunct learning, online Max-SAT, and online decision tree learning. 

1 Introduction 

An online learning task involves repeatedly taking actions and, after an action is chosen, observing the result of that 
action. This is in contrast to offline learning where the decisions are made based on a fixed batch of training data. 
As a consequence offline learning typically requires i.i.d. assumptions about how the results of actions are generated 
(on the training data, and all future data). In online learning, no such assumptions are required. Instead, the metric of 
performance used is regret: the amount of additional utility that could have been gained if some alternative sequence 
of actions had been chosen. The set of alternative sequences that are considered defines the notion of regret. Regret is 
more than just a measure of performance, though, it also guides algorithms. For specific notions of regret, no-regret 
algorithms exist, for which the total regret is growing at worst sublinearly with time, hence their average regret goes 
to zero. These guarantees can be made with no i.i.d., or equivalent assumption, on the results of the actions. 

One traditional drawback of regret concepts is that the number of alternatives considered must be finite. This is 
typically achieved by assuming the number of available actions is finite, and for practical purposes, small. In offline 
learning this is not at all the case: offline hypothesis classes are usually very large, if not infinite. There have been 
attempts to achieve regret guarantees for infinite action spaces, but these have all required assumptions to be made 
on the action outcomes (e.g., convexity or smoothness). In this work, we propose new notions of regret, specifically 
for very large or infinite action sets, while avoiding any significant assumptions on the sequence of action outcomes. 
Instead, the action set is assumed to come equipped with a notion of locality, and regret is redefined to respect this 
notion of locality. This approach allows the online paradigm with its style of regret guarantees to be applied to 
previously intractable tasks and hypothesis classes. 

2 Background 

For t e {1, 2, . . .}, let a* e Abe the action at time t, and u* : A -> K be the utility function over actions at time t. 
Requirement 1. For all t, max a _beA \ ut ( a ) — ut (b)\ < A. 

The basic building block of regret is the additional utility that could have been gained if some action b was chosen 
in place of action a: R^ b = Y^t=i l( a * — a ) ~ M *( a ))> where l(condition) is equal to 1 when condition is true 
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and otherwise. We can use this building block to define the traditional notions of regret. 



Eternal = max i?^, ap = ^ max RT<+ (1) 

aeA 

Eternal = max ( ^ i£ 6 | (2) 




where x + = max(a;,0) so that R^,'t = max(R^ b ,0). Internal regret [Hart and Mas-Colell, 2002] is the maximum 
utility that could be gained if one action had been chosen in place of some other action. Swap regret [Greenwald 
and Jafari, 2003] is the maximum utility gained if each action could be replaced by another. External regret [Hannan, 
1957], which is the original pioneering concept of regret, is the maximum utility gained by replacing all actions with 
one particular action. This is the most relaxed of the three concepts, and while the others must concern themselves 
with \ A\ 2 possible regret values (for all pairs of actions) external regret only need worry about \A\ regret values. So 
although the guarantee is weaker, it is a simpler concept to learn which can make it considerably more attractive. 
These three regret notions have the following relationships. 

-^-internal — ^swap — I ^ I ^internal -^external — ^-swap (?) 



2.1 Infinite Action Spaces 

This paper considers situations where A is infinite. To keep the notation simple, we will use max operations over 
actions to mean suprema operations and summations over actions to mean the suprema of the sum over all finite 
subsets of actions. Since we will be focused on regret over a finite time period, there will only ever be a finite set of 
actually selected actions and, hence only a finite number of non-zero regrets, R% b . The summations over actions will 
always be thought to be restricted to this finite set. 

None of the three traditional regret concepts are well-suited to A being infinite. Not only does \A\ appear in 
the regret bounds, but one can demonstrate that it is impossible to have no regret in some infinite cases. Consider 
A = N and let u* be a step function, so u* (a) = 1 if a > y* for some y* and otherwise. Imagine y 1 is selected 
so that Pr[a* > j/ t |ti 1 '"'' T_1 ,a 1 '""' T_1 ] < 0.001, which is always possible. Essentially, high utility is always just 
beyond the largest action selected. Now, consider y* = 1 + max t <T y t . In expectation ^ X^t=i u *( a *) — 0.001 while 
y J2t=i ut {y*) — 1 (i- e -> there is large internal and external regret for not having played y*,) so the average regret 
cannot approach zero. 

Most attempts to handle infinite action spaces have proceeded by making assumptions on both A and u. For 
example, if A is a compact, convex subset of R™ and the utilities are convex with bounded gradient on A, then you 
can minimize regret even though A is infinite [Zinkevich, 2003]. We take an alternative approach where we make use 
of a notion of locality on the set A, and modify regret concepts to respect this locality. Different notions of locality 
then result in different notions of regret. Although this typically results in a weaker form of regret for finite sets, it 
breaks all dependence of regret on the size of A and allows it to even be applied when A is infinite and u is an arbitrary 
(although still bounded) function. Wide range regret methods Lehrer [2003] can also bound regret with respect to a set 
of (countably) infinite "alternatives", but unlike our results, their asymptotic bound does not apply uniformly across 
the set, and uniform finite-time bounds depend upon a finite action space Blum and Mansour [2007]. 



3 Local Regret Concepts 

Let G = (V, E) be a directed graph on the set of actions, i.e., V = A. We do not assume A is finite, but we do 
assume G has bounded out-degree D = max aG y \{b : (a, b) e E}\. This graph can be viewed as defining a notion of 
locality. The semantics of an edge from a to b is that one should consider possibly taking action b in place of action 
a. Or rather, if there is no edge from a to b then one need not have any regret for not having taken action b when a 
was taken. By limiting regret only to the edges in this graph, we get the notion of local regret. Just as with traditional 
regret, which we will now refer to as global regret, we can define different variants of regret. 

^localintemal = , max ^a 'b Aocalswap = / , , T 11 ^ „ ,6 ^) 

(a,b)eE 1 — * b:(a,b)£E 



2 



Local internal and local swap regret just involve limiting regret to edges in G. Local external regret is more subtle 
and requires a notion of edge lengths. For all edges e E, let c(i,j) > be the edge's positive length. Define 
d(a, b) to be the sum of the edge lengths on a shortest path from vertex a to vertex b, and E b = {(i, j) £ E : d(i, j) = 
c(i,j) + d(j, b)} to be the set of edges that are on any shortest path to vertex b. 

( T V 

Aocalextemal ~ I E I @) 

Global external regret considers changing all actions to some target action, regardless of locality or distance between 
the actions. In local external regret, only adjacent actions are considered, and so actions are only replaced with actions 
that take one step toward the target action. The factor of 1/D scales the regret of any one action by the out-degree, 
which is the maximum number of actions that could be one-step along a shortest path. This keeps local external regret 
on the same scale as local swap regret. 

It is easy to see that these concepts hold the same relationships between each other as their global counterparts. 

^localinternal — ^localswap — I ^ I ^localinternal (6) 
J Mocalexternal — ^Mocalswap \' / 

More interestingly, in complete graphs where there is an edge between every pair of actions (all with unit lengths) and 
so everything is local, we can exactly equate global and local regret. 

Theorem l.IfG is a complete graph with unit edge lengths then, 

^localinternal ~ ^-internal ^localswap — ^swap an & ^localexternal — ^external / D ■ (8) 



Proof. 



Aocalinternal ~ , ma * R \ — max R — -R internal (9) 

(a,b)eE ' a,b£A 



fli„ nk „ n = > max R n \ = > maxR = -R<- wan (10) 

localswap £^ a,b A^ beA a,b swap 

a£A ' a£A 

^localexternal = max ^ R lf/ D ( U ) 



66 A 

(t,j)£E b 



= 1 / D E R a,b ^ Rl^nJD (12) 



b£A 

a£A 
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So our concepts of local regret match up with global regret when the graph is complete. Of course, we are not really 
interested in complete graphs, but rather more intricate locality structures with a large or infinite number of vertices, 
but a small out-degree. Before going on to present algorithms for minimizing local regret, we consider possible graphs 
for three different online decision tasks to illustrate where the graphs come from and what form they might take. 

Example 1 (Online Max-3SAT). Consider an online version of Max-3SAT. The task is to choose an assignment for 
n boolean variables: A = {0, 1}". After an assignment is chosen a clause is observed; the utility is 1 if the clause is 
satisfied by the chosen assignment, otherwise. Note that \A\ = 2™ which is computationally intractable for global 
regret concepts if n is even moderately large. One possible locality graph for this hypothesis class is the hypercube 
with an edge from a to 6 if and only if a and b differ on the assignment of exactly one variable (see Figure 1(a)), and 
all edges have unit lengths. So the out-degree D for this graph is only n. Local regret, then, corresponds to the regret 
for not having changed the assignment of just one variable. In essence, minimizing this concept of regret is the online 
equivalent of local search (e.g., WalkSAT [Selman et ai, 1993]) on the maximum satisfiability problem, an offline task 
where all of the clauses are known up front. 
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Figure 1: Example graphs, (a) Graph for Max-3SAT and disjuncts (n = 3). (b) Part of graph for decision trees 
(n = 2), where edges to and from the dashed boxes represent edges to and from every vertex in the box. 

Example 2 (Online Disjunct Learning). Consider a boolean online classification task where input features are boolean 
vectors x £ {0, 1}" and the target y is also boolean. Consider A = {0, 1}™, to be the set of all disjuncts such that 
a £ A corresponds to the disjunct V s, 2 V . . . V Xi k where ii<j<k are all of the k indices of a such that a^. = 1. 
In this online task, one must repeatedly choose a disjunct and then observe an instance which includes a feature vector 
and the correct response. There is a utility of 1 if the chosen disjunct over the feature vector results in the correct 
response; otherwise. Although a very different task, the action space A = {0, 1}" is the same as with Online Max- 
SAT and we can consider the same locality structure as that proposed for disjuncts: a hypercube with unit length edges 
for adding or removing a single variable to the disjunction (see Figure 1(a)). And as before \A\ = 2™ while D = n. 

Example 3 (Online Decision Tree Learning). Imagine the same boolean online classification task for learning dis- 
juncts, but the hypothesis class is the set of all possible decision trees. The number of possible decision trees for n 
boolean variables is more than a staggering 2 2 , which for any practical purpose is infinite. We can construct a graph 
structure that mimics the way decision trees are typically constructed offline, such as with C4.5 [Quinlan, 1993]. In 
the graph G, add an edge from one decision tree to another if and only if the latter can be constructed by choosing 
any node (internal or leaf) of the former and replacing the subtree rooted at the node with a decision stump or a label. 
There is one exception: you cannot replace a non-leaf subtree with a stump splitting on the same variable as that of 
the root of the subtree. See Figure 1(b) for a portion of the graph. Edges that replace a subtree with a label have 
length 1, while edges replacing a subtree with a stump (being a more complex change) have distance 1.1. So, we 
have local regret for not having further refined a leaf or collapsing a subtree to a simpler stump or leaf. Notice that 
the graph edges in this case are not all symmetric (viz., collapsing edges). In essence, this is the online equivalent of 
tree splitting algorithms. While \A\ > 2 2 , the out-degree is no more than (n + 1)2" +1 . The maximum size of the 
out-degree still appears disconcertingly large, and we will return to this issue in Section 5 where we show how we can 
exploit the graph structure to further simplify learning. 

4 An Algorithm for Local Swap Regret 

We now present an algorithm for minimizing local swap regret, similar to global swap regret algorithms [Hart and 
Mas-Colell, 2002; Greenwald and Jafari, 2003], but with substantial differences. The algorithm essentially chooses 
actions according to the stationary distribution of a Markov process on the graph, with the transition probabilities on 
the edges being proportional to the accumulated regrets. However there are two caveats that are needed for it to handle 
infinite graphs: it is prevented from playing beyond a particular distance from a designated root vertex, and there is an 
internal bias towards the actual actions chosen. 
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Formally, let root be some designated vertex. Define di to be the unweighted shortest path distance between two 
vertices. Define the level of a vertex as its distance from root: £(v) — di(root, v). Note that, £(root) = 0, and 
€ E, C(j) < C(i) + 1. All of the algorithms in this paper take a parameter L, and will never choose actions at 
a level greater than L. In addition, the algorithms all maintain values R\ j (which are biased versions of R\ •) and use 
these to compute 7r*, the probability of choosing action j at time t. These probabilities are always computed according 
to the following requirement, which is a generalization of [Hart and Mas-Colell, 2002; Greenwald and Jafari, 2003]. 

Requirement 2. Given a parameter L, for all t <T, and some R*'^ let ir t+1 be such that 
W Ejev = 1. and\/j e V, tt* +1 > 

(b) Vj e V such that C(j) > L, 7r* +1 = 0. 

(c) Vj e V such that 1 < C(j) < L, = £ i:(i , j)£B (i%+/MK' +1 + (1 - £ fc:(j - fc)6ls ^/M)tt* +1 

tt^ 1 = 

£i:(i,root)e#(^i,'root/-^0 7r i + Sj:£(j)=L+l j)£E i^ilj + U — £j:(root j)£E ^rootj /-^) 7r root 

(e) If there exists j £ V such that tt* +1 > and X^.q- fc)£B ^Vifc = 0> tnen f or a ^ J ^ V where tt* +1 > 0, 
2~2k-(j fc)G£; ^jfe = 0' an d we ca tt sucn a 71-4+1 degenerate. 

where M — max( i; j) £E -R-' + - 77ze.se conditions require n t+1 to be the stationary distribution of the transition 
function whose probabilities on outgoing edges are proportional to their biased positive regret, with the root vertex as 
the starting state, and all outgoing transitions from vertices in level L going to the root vertex instead. 

Definition 2. (6, L)-regret matching is the algorithm that initializes R® • = 0, chooses actions at time t according to 
a distribution 7r* that satisfies Requirement 2 and after choosing action i and observing u l updates R\ - = R^ 1 + 
— ut {i) — b) for all j where (i, j) £ E, and for all other (k, I) € E where k ^ i, R\. x = R^ 1 . 

There are two distinguishing factors of our algorithm from [Hart and Mas-Colell, 2002; Greenwald and Jafari, 
2003]: R 7^ R, and past a certain distance from the root, we loop back. R differs from R by the bias term, b. This 
term can be thought of as a bias toward the action selected by the algorithm. This is not the same as approaching the 
negative orthant with a margin for error. This small amount is only applied to the action taken, which is very different 
from adding a small margin of error to every edge. 

Theorem 3. For any directed graph with maximum out-degree D and any designated vertex root, (A/ (L + 1),L)- 
regret matching, after T steps, will have expected local swap regret no worse than, 

where E L = e E\C(i) < L}. 

The proof can be found in Appendix A. The overall structure of the proof is similar to [Blackwell, 1956; Hart 
and Mas-Colell, 2002; Greenwald and Jafari, 2003] with a few significant changes. As with most algorithms based on 
Blackwell, if there is an action you do not regret taking, playing that action the next round is "safe". If not, the key 
quantity in the proof is a flow fij = 7r* +1 i?*' + for each edge. On most of the graph, the incoming flow is equal to the 
outgoing flow for each node in levels 1 to L. Since all the flow out from the nodes on one level is equal to the flow 
into the next, the total flow into (and out of) each level is equal. Thus, the flow out of the last level is only 1/(L + 1) 
of the total flow on all edges since there are L + 1 levels, including the root. 

Traditionally, we wish to show that the incoming flow of an action times the utility minus the outgoing flow of 
an action times the utility summed over all nodes is nonpositive, and then Blackwell's condition holds. In traditional 
proofs, for any given node, the flow in and out are equal, so regardless of the utility, they cancel. For our problem, 
the flow out of the last level is really a flow into the (L + l)st level, not the zeroeth level, so the difference in utilities 
between the zeroeth level and the (L + l)st level creates a problem. On the other hand, because we subtract b from 
whatever action we select, we get to subtract b times the total flow. Since exactly 1/(L + 1) fraction of the flow is 
going into the (L + l)st level, these two discrepancies from the traditional approach exactly cancel. The second term 
of Equation (13) is a result of the traditional Blackwell approach. In the final analysis, we must account for the amount 
b we subtract from the regret each round. This means that if we get R to approach the negative orthant, we only have 
bT local swap regret left. This is the first term of Equation (13). 
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5 Exploiting Locality Structure 



The local swap regret algorithm in the previous section successfully drops all dependence on the size of the action 
set and thus can be applied even for infinite action sets. However, the appearance of \El\ in the bound in Theorem 3 
is undesirable as \E L \ £ 0(D L ), and L is more likely to be 100 than 2, in order to keep the first term of the bound 
low. The bound, therefore, practically provides little beyond an asymptotic guarantee for even the simplest setting of 
Example 1. In this section, we will appeal to (i) the structure in the locality graph, and (ii) local external regret to 
achieve a more practical regret bound and algorithm. 



5.1 Cartesian Product Graphs 

We begin by considering the case of G having a very strong structure, where it can be entirely decomposed into a set 
of product graphs. In this case, we can show that by independently minimizing local regret in the product graphs we 
can minimize local regret in the full graph. 

Theorem 4. Let G be a Cartesian product of graphs, G — G\ <£> . . . <g> Gk where Gi — (Vi,Ei). For all I e {1, . . . , k}, 
define u\ : Vi — > M, such that u\{ai) = u*((a* , . . . , a'_ 1; a/, aj +1 , . . . , so u\ is a utility function on the Ith 

component of the action at time t assuming the other components remain unchanged. Let E[l] C E be the set of edges 

that change only on the Ith component, so {E[l\\i =1 kf orms a partition of E. Let Di < D be the maximum degree 

ofGi. Finally, define 

^external = ™« ( £ £ M = 0(«'C?) " ] . 

' \(ij)eB?t=i / 



where E\ — € E[l] : d(i,bi) = c(i,j) + d(j,bi)}, i.e., it contains the edges that moves the Ith component 

closer to b h Then, i?£ C alexternal < £?=1 #localexternal' 

Proof. 



#localextemal = max ^ R ljl D \ ( 14 ) 

+ 



k 



= ^(£ £ R lii D \ ^ 

i=i (i,j)eE[i]nE" 



bev 

k 



+ 



^£^ £ R L/ D ) ^ 

1=1 \(i,j)£Ell]r\E<> 



Since D x < D, 



k 



+ 



^£^ £ R H D A < 17 > 

1=1 \(i,j)GE[l]r\E b 
hi T 

= £^ £ Y J l{a* = i){u\j)-u\i))/D l \ (18) 
i=i \(i,f)eE[i]nE b t=i 

IT 



= £^( £ £l(« t = i)K(i)-^W)/A] (19) 

KihMEf t=l 



bev 
i=i 



k 



£ localextemal @0) 



=1 
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The implication is that we if we apply independent regret minimization to each factor of our product graph, we 
can minimize local external regret on the full graph. For example, consider the hypercube graphs from Example 1 
and 2. By applying n independent external regret algorithms (the component graphs in this case are 2-vertex complete 
graphs), the overall local external regret for the graph is at most n times bigger than the factors' regrets, so under regret 
matching it is bounded by Hence, we are able to handle an exponentially large graph (in n) with local 

external regret only growing linearly (in n). If the component graphs are not complete graphs, then we can simply 
apply our local swap regret algorithm from the previous section to the graph factors, which minimizes local external 
regret as well. 



5.2 Color Regret 

Cartesian product graphs are a powerful, but not very general structure. We now substantially generalize the product 
graph structure, which will allow us to achieve a similar simplification for very general graphs, such as the graph on 
decision trees in Example 3. The key insight of product graphs is that for any vertex b, an edge moves toward b if 
and only if its corresponding edge in its component graph moves toward 6j. In other words, either all of the edges 
that correspond to some component edge will be included in the external regret sum, or none of the eges will. We can 
group together these edges and only worry about the regret of the group and not its constituents. We generalize this 
fact to graphs which do not have a product structure. 

Definition 5. An edge-coloring C = {(",} , l,2,... for an arbitrary graph G with edge lengths is a partition of E: 
Ci C E, (Jj Cj = E, and C'i |"| Cj = 0. We say that C is admissble if and only if for all b € V, C € C, and 
(i,j),{i',f) € C, d(i,b) = c(i, j) + d(i, b) d(i', b) = c(i',f) + d(f, b). In other words, for any arbitrary target, 
all of the edges with the same color are on a shortest path, or none of the edges are. 

We now consider treating all of the edges of the same color as a single entity for regret. This gives us the notion of 
local colored regret. 

^Lalcolor = E EC' ^l) 

cec \(i,j)ec J 

Theorem 6. IfC is admissible then i?£ calexternal < ^Licoior/^- 
Proof. 

( T V 

^kjcalexternal ~ ^ax I ^ R.J j / D J (22) 

\cec (i,j)ecnE<> J 

For a particular target b let C b = {C e C : C C E b }, i.e., C b is the set of colors that reduces the distance to b. Then 
by C's admissibility, 

#localextemal = max ^ ^ Rfj/D (24) 

\CGC b (t,j)ec J 

E f E R lA (25) 

cec b \(i,j)ec J 

< E f E *?j/A (26) 
cec \(i,j)ec J 

= ^localcolor/^ (27) 
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□ 

So by minimizing local colored regret, we minimize local external regret. The natural extension of our local swap 
regret algorithm from the previous section results in an algorithm that can minimize local colored regret. 

Definition 7. (b, L, C)-colored- regret-matching is the algorithm that initializes R? c — 0, for all C G C, chooses 
actions at time t according to a distribution 7r* that satisfies Requirement 2 with R\ - = R^ .y and after choosing 

action i and observing it* at time tfor all C G C updates R l c = R^ 1 + 2~2j-(i j)ec( u *(-?) — — ^). 

Theorem 8. For an arbitrary graph G with maximum degree D, arbitrarily chosen vertex root, and edge coloring 
C, (A/(L + 1), L, C)-colored-regret matching applied after T steps will have expected local colored regret no worse 
than, 

1 F\R T 1< AD + A V / ^[ 



where C L = {C G C|3(*,j) G C s.t. C{i) < L}. 

The proof is in Appendix B. The consequence of this bound depends upon the number of colors needed for an 
admissible coloring. Very small admissible colorings are often possible. The hypercube graph needs only 2n colors 
to give an admissible coloring, which is exponentially smaller than the total number of edges, n2 n . We can also find a 
reasonably tight coloring for our decision tree graph example, despite being a complex asymmetric graph. 

Example 4 (Colored Decision Tree Learning). Reconsider Example 3 and the graph in Figure 1(b). Recall that an 
edge exists between one decision tree and another if the latter can be constructed from the former by replacing a 
subtree at any node (internal or leaf) with a label (edge length 1) or a stump (edge length 1.1). We will color this 
edge with the pair: (i) the sequence of variable assignments that is required to reach the node being replaced, and 
(ii) the stump or label that replaces it. This coloring is admissible. We can see this fact by considering a color: the 
sequence of variable assignments and resulting stump or label. If this color is consistent with the target decision tree 
(i.e., the sequence exists in the target decision tree, and the variable of the added stump matches the variable split on 
at that point in the target decision tree) then the color must move you closer to the target tree. A formal proof of its 
admissibility is very involved and can be found in Appendix C. 



6 Experimental Results 

The previous section presented algorithms that minimize local swap and local external regret (by minimizing local 
colored regret). The regret bounds have no dependence on the size of the graph beyond the graph's degree, and so 
provide a guarantee even for infinite graphs. We now explore these algorithms' practicality as well as illustrate the 
generality of the concepts by applying them to a diverse set of online problems. The first two tasks we examine, online 
Max-3SAT and online decision tree learning, have not previously been explored in the online setting. The final task, 
online disjunct learning, has been explored previously, and will help illustrate some drawbacks of local regret. 

In all three domains we examine two algorithms. The first minimizes local swap regret by applying (A/(X+1), L)- 
regret matching with L chosen specifically for the problem. This will be labeled "Local Swap". The second focuses 
on local external regret by using a tight, admissible edge-coloring and applying (A/(i + 1), L, C) -colored-regret 
matching. This will be labeled simply "Local External". 

6.1 Online Max-3SAT 

First, we consider Example 1. We randomly constructed problem instances with n = 20 boolean variables and 201 
clauses each with 3 literals. On each timestep, the algorithms selected an assignment of the variables, a clause was 
chosen at random from the set, and the algorithm received a utility of 1 if the assignment satisfied the clause, 
otherwise. This was repeated for 1000 timesteps. The locality graph used was the n-dimensional hypercube from 
Example 1. The admissible coloring used to minimize local external regret was the 2n coloring that has two colors per 
variable (one for turning the variable on, and one for turning the variable off). In both cases we set L = oo and b = 0, 
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since the bounds do not depend on L once it exceeds 20. This also achieved the best performance for both algorithms. 
The average results over 200 randomly constructed sets of clauses are shown in Figure 2, with 95% confidence bars. 

Figure 2 (a) shows the time-averaged colored regret of the two algorithms, to demonstrate how well the algorithms 
are actually minimizing regret. Both are decreasing over time, while external regret is decreasing much more rapidly. 
As expected, swap regret may be a stronger concept, but it is more difficult to minimize. The local external regret 
algorithm after only one time step can have regret for not having made a particular variable assignment, while local 
swap regret has to observe regret for this assignment from every possible assignment of the other variables to achieve 
the same result. This is further demonstrated by the number of regret values each algorithm is tracking: local external 
regret on average had 34 non-zero regret values, while local swap regret had 4200 non-zero regret values. In summary, 
external regret provides a powerful form of generalization. Figure 2 (b) shows the fraction of the previous 100 clauses 
that were satisfied. Two baselines are also presented. A random choice of variable assignments can satisfy ~ of the 
clauses in expectation. We also ran WalkSAT [Selman et at, 1993] offline on the set of 201 clauses, and on average it 
was able to satisfy all but 4% of the clauses, which gives an offline lower bound for what is possible. Both substantially 
outperformed random, with the external regret algorithm nearing the performance of the offline WalkSat. 

6.2 Online Decision Tree Learning 

Second, we consider Example 3. We took three datasets from the UCI Machine Learning Repository (each with cat- 
egorical inputs and a large number of instances): nursery, mushroom, and king-rook versus king-pawn [Frank and 
Asuncion, 2010]. The categorical attributes were transformed into boolean attributes (which simplified the implemen- 
tation of the locality graphs) by having a separate boolean feature for each attribute value. 1 We made the problems 
online classification tasks by sampling five instances at random (with replacement) for each timestep, with the utility 
being the number classified correctly by the algorithm's chosen decision tree. This was repeated for 1000 timesteps, 
and so the algorithms classified 5,000 instances in total. The locality graph used was the one described in Example 3. 
The tight coloring used to minimize local external regret was the one described in Example 4. L was set to 3 for local 
swap regret, and 100 for local external regret, as this achieved the best performance. Even with the far larger graph, 
the external regret algorithm was observing nearly one-eighth of the number of non-zero regret values observed by the 
local swap algorithm. The average results over 50 trials are shown in Figure 3(a)-(c) with 95% confidence bars. 

The graphs show the average fraction of misclassified instances over the previous 100 timesteps. Two baselines 
are also plotted: the best single label (i.e., the size of the majority class) and the best decision stump. Both regret 
algorithms substantially improved on the best label, and local external regret was selecting trees substantially better 
than the best stump. As a further baseline, we ran the batch algorithm C4.5 in an online fashion, by retraining a decision 
tree after each timestep using all previously observed examples. C4.5's performance was impressive, learning highly 
accurate trees after observing only a small fraction of the data. However, C4.5 has no regret guarantees. As with 

'As a result, there were n = 28 features for nursery, 118 features for mushroom, and 74 features for king-rook versus king-pawn. 
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Figure 3: Results for online decision tree learning on three UCI datasets: (a) Nursery, (b) Mushroom, (c) King 
Rook/King-Pawn; and (d) a simple sequence of alternating labels. 



any offline algorithm used in an online fashion, there is an implicit assumption that the past and future data instances 
are i.i.d.. In our experimental setup, the instances were i.i.d., and as a result C4.5 performed very well. To further 
illustrate this point, we constructed a simple online classification task where instances with identical attributes were 
provided with alternating labels. The best label (as well as the single best decision tree) has a 50% accuracy. C4.5 
when trained on the previously observed instances, misclassifies every single instance. This is shown along with local 
regret algorithms in Figure 3 (d). 

6.3 Online Disjunct Learning 

Finally, we examine online disjunct learning as described in Example 2. This task has received considerable attention, 
notably the celebrated Winnow algorithm [Littlestone, 1988], which is guaranteed to make a finite number of mistakes 
if the instances can be perfectly classified by some disjunction. Furthermore, the number of mistakes Winnow2 makes, 
when no disjunction captures the instances, can be bounded by the number of attribute errors (i.e., the number of input 
attributes that must be flipped to make the disjunction satisfy the instance) made by the best disjunction. In these 
experiments we compare our algorithms' performance to that of Winnow2. 

We looked at two learning tasks. In the first, we generated a random disjunction over n = 20 boolean variables, 
where a variable was independently included in the disjunction with probability 4/n. Instances were created with 
uniform random assignments to all of the variables, with a label being true if and only if the chosen disjunct is true 
for the instance's assignment. In the second case, we chose instances uniformly at random from a constructed set of 
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Figure 4: Results for online disjunct learning: (a) random disjunct, (b) Winnow Killer. 



21 instances: one for each variable with that variable (only) set to true and the label being true, and one with all of 
the variables assigned the value of true and the label being false. We call this task Winnow Killer. For both tasks, 
the n-dimensional hypercube from Example 1 was used as the locality graph with the In coloring as our admissible 
coloring, and L = oo and 6 = 0. The average results over 50 trials are shown in Figure 4, with 95% confience bars. 

The graphs plot error rates over the previous 100 instances. Three baselines are plotted: randomly assigning a 
label (guaranteed to get half of the instances correct on expectation), the best disjunct (which makes no mistakes for 
random disjunctions and makes Jj- mistakes on the Winnow Killer task), and Winnow2. Figure 4 (a) shows the results 
on random disjunctions. Winnow2 is guaranteed to make a finite number of mistakes and indeed its error rate drops to 
zero quickly. The local regret concepts, though, have difficulties with random disjunctions. The reason can be easily 
seen for the case of local external regret. Suppose the first instance is labeled true; the algorithm now has regret for 
all of the variables that were true in that instance (some of these will be in the target disjunction, but many will not). 
These variables will now be included in the chosen disjunction for a very long time, as the only regret that one can have 
for not removing them is if their assignment was the sole reason for misclassifying a false instance. In other words, 
the problem is that there's no regret for not removing multiple variables simultaneously as this is not a local change. 
Winnow2, though, also has issues. It performs very poorly in the Winnow Killer task (in fact, if the instances were 
ordered it could be made to get every instance wrong), as shown in Figure 4 (b). Since the mistake bound for Winnow2 
is with respect to the number of attribute errors, a single mistake by the best disjunction can result in n mistakes by 
Winnow2. A further issue with Winnow is that while its peformance is tied to the performance of disjunctions, its own 
hypothesis class is not disjunctions but a thresholded linear function, whereas local regret is playing in the same class 
of hypotheses that it comparing against. 



7 Conclusion 

We introduced a new family of regret concepts based on restricting regret to only nearby hypotheses using a locality 
graph. We then presented algorithms for minimizing these concepts, even when the number of hypotheses are infi- 
nite. Further we showed that we can exploit structure in the graph to achieve tighter bounds and better performance. 
These new regret concepts mimic local search methods, which are common approaches to offline optimization with 
intractably hard hypothesis spaces. As such, our concepts and algorithms allows us to make online guarantees, with a 
similar flavor to their offline counterparts, with these hypothesis spaces. 

There is a number of interesting directions for future work as well as open problems. Admissible colorings can re- 
sult in radically improved bounds as well as empirical performance. How can such admissible colorings be constructed 
for general graphs? What graph structures lead to exponentially small admissible colorings compared to the size of the 
graph? We can easily construct the minimum admissible coloring for graphs that are recursively constructed as Carte- 
sian product of graphs and complete graphs. While such graphs can have exponentially small admissible colorings, 
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they form a very narrow class of structures. What other structures lead to exponentially small admissible colorings? 
Furthermore, edge lengths can have a significant impact on the size of the minimum admissible coloring. For example, 
the decision tree graph from Example 3 was carefully constructed to result in a tight coloring, and, in fact, unit length 
edges over the same graph would result in an exponentially larger admissible coloring. How can edge lengths be 
defined to allow for small minimum colorings? 
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A Proof for Local Swap Regret 



At its heart, the Hart and Mas-Colell proof for minimizing internal regret relies on the relationship between Markov 
chains and flows. The Blackwell condition is (roughly speaking) that the probability flow into an action equals the 
probability flow out of an action. In the variant here, there are two ways to view this flow. Define / such that for all 
€ E, fij = nl +1 R t i ^. Implicitly, / depends on the time t, but we supress this as we always refer to a time t. 
This flow / is similar to the flows in Hart and Mas-Colell as they apply to the Blackwell condition. However, it lacks 
the conservation of flow property. Thus, we consider a second flow /' which satisfies the conservation of flow. To do 
this, we consider the levels of the graph. To review, root is a distinct vertex, and, C(v) = e?i(root, v). If we consider 
the flow / as starting from the root, it (roughly) goes from level to level outward from the root until it reaches level 
L. Then, while / flows to level L + l and reaches a dead end (violating the conservation property), /' is switched, 
and flows back to the root. In order to make the proof work, we have to bound the difference between / and /'. Since 
this difference is mostly on the flow from level L to level L + 1, we need to bound the fraction of the total flow that is 
going out of the last level by showing that this flow is less than the flow going from the root to the first level, and it is 
less than the flow from the first level to the second level, et cetera. 

First we show that for nodes on most levels, the flow in equals the flow out. 

Lemma 9. If Requirement 2 holds, then for all j £ V such that 1 < £(j) < L, 

E f*j= E a* 

i:(i,j)£E k:(j,k)GE 



Corollary 10. By summing over the nodes in level I, for any level 1 < £ < L, 

E A; = E kr 

(i,j)eE:C(j)=£ (i,j)eE:C(i)=l 

Proof. From Requirement 2(c) we know there exists an M > such that: 



£ (i?* ; ;/M) 7 r*+ i + i- J2 4*/ M H +1 (28) 

:(i,j)eE \ k-.(j,k)eE 



E R Tki M = E (29) 

K k:(j,k)eE J i:(i,j)£E 

E «r^= E (30) 

k:(j,k)eE i:(iJ)£E 

The lemma follows by the definition of fi j. □ 

If we want the conservation of flow to hold for all nodes, then we need to define a slightly different flow. We want 
to say that the flow which is currently exiting the first L levels (specifically between level L and level L + 1) is actually 
flowing back into the root. So, we want to subtract the edges E" = € E : C(j) > L+l\/£(i) > L+l}, and add 

the edges E' = {i £ V : C(i) = L} x {root}. For any edge e £ E', define f[ j = f id + 2~2k-X(k)=L+i,(i,k)eE fi,k-, 
where f itj = if £ E. For any edge £ E\(E' U E") where C(i),C{j) < L, f l} = f' itj . Define 

E= (E U E'\E"). 

Thus, we now have a flow over a graph (V, E), but we must prove conservation of flow. 
Lemma 11. If Requirement 2 holds, for any i £ V, Y. r .(i,j)eE fl,j = 2~2j-.(j,i)eE fj,i 

Proof. For C(i) £ {1 . . . L — 1}, this is a direct result of Requirement 2(c). For when C(i) > L, there is no flow out 
or in, making the result trivial. For C(i) — (when i = root), this is a direct result of Requirement 2(d). For when 
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C{i) = L, note that Y,j-.(j.i)eE fj,i = T,j-.(i,j)eE hi> for a11 3 where CM) e £, = /j^, and for all e £ 
where £(j) e {1 . . . L}, / il3 - = and that the flow E r .^)eEXU)e{o,L+i} hi = /(root- so 

E E ,31 > 

j-U,i)eE j-(j,i)eE 

= E /*j w 

j-(i,j)eE 

E + E hi w 

j-.(i,j)eE,c(j)e{o,L+i} j:(i,j)eE,c(i)e{i...L} 

= /Ut+ E ^ < 34 > 

j:(i,j)eB,£(j)e{i...L} 
= E hr (35) 



i-(i,i)eE 



□ 



Lemma 12. If Requirement 2 holds, then: 



E >(£+i) E a* < 36 > 

(iJ)£B (i,j)eB:£(j)=L+l 

E • E ,37 > 

(»,j)eE:i(i)=0 (i,j)£E:C(j)=L+l 

Proof. To obtain an intuition, consider the case where all outgoing edges from level j go to level j + 1 (modulo the 
last level). In this case, the flow from level all goes to level 1, from there goes to level 2, and so forth until it reaches 
level L and then returns to level 0. Thus, the inflows and outflows of all the levels would be equal. The problem with 
this is that outgoing edges from level j can go to other nodes in j, or nodes in level j — 1, et cetera. At an intuitive 
level, a backwards flow would not make more flow through the final level, any more than an eddy would somehow 
create water at the mouth of a river, and we must simply formally prove this. 

First, we define g it j = J2(k i)eE-x(k)=i C(l)=j fl,j> tne tota l f° w between levels. By Lemma 11 for all i e 
V, 2~2j fij = 2~2j f'j,i> so tne aggregate flow satisfies the conservation of flow, namely that for all i, 2~2j=o9i,i = 
2~2f—o9i,i- Also, if j > i + 1, then g^j = 0. Define rij — g^+i, the flow between one level and the next. Since 
/, /', and g are just different groupings of the total flow throughout the graph, 2~2(i,j)eE hi = 2~2(i,j)eE hi = 
E» L =o T,j=o 9i,i- Since for a11 h j G V, fi tj > 0, then for all g itj > 0. g Lfi + J2t=o n i < T,(i,j)eE hi- 

Moreover, n = g ,i = E j: (root,j)eB fLt,j = T,j:(root,j)eE /rootj, and g Lfl > E (iJ)eJ5:/ ; 0)=L+1 hi- So if we 
prove that for all i, <?l.o < n-i, then g^fi < n o and that gL.o(L + 1) < <?l,o + 2~2f=o n i> we nave proven the lemma. 

First, we identify this backwards flow. Define Si to be the flow that originates at level i or above and flows back to 
a lower level. Formally, define So = 0, and St = Ei'<» y>% 9i',i' ~ 9l,o- Note that Si > 0. 

Thus, for all i where < i < L: 
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Si - S i+ i = 
Si - S i+ i = 
Si - S i+ i = 



Si - s 



i+1 



Si - S i+ i = 



'<» 



Si - S i + l : 

Since g ii+1 = m, and #i_ M = 



Si - S i+ i = 
Si - Si+i = 



Y 9j',i' 

i f <,i,j'>i 

Y 9j>,i> 

i'<i,j'=i 

Y 9j',i' 

i'<i,j'=i 

Y^' - 
i'<i J 



i' <i 



9%,% + Y 

Y^ 



Y 

v »'<i+l,j'>i+l 

E 9j',i 

^i' <i,j'>i+l 

Y 9j',i 



Y 9j' 

ii'=i,j'>i+l 



Y 9j',i 

i'<i,j>>i+l 



Y 9j> 

'>»+! 



j'>*+l 



j'>i 



-m + Y 9i 

i'<i+l 



-rii-i + Y 9j',i 
j'>i-i 



Since g represents the level graph, gi^ — if i' > i + 1, or put another way, gy ^ = if j' < i — 1, so 



<5i - 



Si - = - n t 
So, for all < i < L - 1: 



- <5i+2 = rij - Uj+i 

"i = <**+i - S i+2 + n i+ i 



(38) 



(39) 



(40) 



(41) 



(42) 



(43) 



(44) 



(45) 



(46) 
(47) 



(48) 
(49) 



For n , note that }~2i9i,o = 3o,o + Si + g Lfi , and J2i9o.i = 9o,o + n , so g 0fi + Si + g Lfi = g ,o + n , and 
<7l.o = — This is the base case in a recursive proof that for all i < L, g^xi = — Si. If we wish to prove it 
holds for i + 1, then we assume it holds for i, or g L = rij — 5$. By Equation (49), for i < L — 1: 



(50) 
(51) 



Since Si > 0, this implies that for i <T, g^fl < n i7 which completes the proof. □ 
Lemma 13. If Requirements 1 and 2 hold, and b = A/(L+ 1), then J2(t.j)eE ^l',t n i +1 i ut+1 U) - u t+1 (i) - b) < 0. 
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Proof. First, consider the case where 7r* is degenerate. Then, whenever 7r* +1 > 0, we know #*;t = for all 
G £7, and so our sum of interest is exactly 0. Note that, since fij = Rf^'K t i + , what we need to prove is: 

E /w(«* +1 0')-« t+1 (0-&)<0 (52) 

(»,j)eB 

E /ij(« t+1 0')-« t+1 (0)J "6 E /',•• °- (53) 

Suppose 7r* is not degenerate. We examine Equation (53)'s two summations. Notice that only edges where 
7r- +1 > have fcj ^ 0, and by Requirement 2(e) this is only true if £(i) < L. Also, fcj > if and only if 
< C(i) < L and 1 < C(j) < L + 1 (because level zero has no incoming edges), so: 

E fiAu t+1 U)-u t+1 d))= E E (54) 

(i,j)eE (i,j)eE (i,j)EE 

L+l L 

= E E Ay +1 (i)-E E /<.X +1 (0- (55) 

1=1 (i,j)£E:C(j)=t e=0 (i,j)£E:C(i)=i 

Renaming the dummy variables in the second term and then combining: 

E kj(u t+1 (j)-u^(i)) = J2 E /^* +1 (i)-E E kkU t+ \3) (56) 
(»,j)€B <=i (i,j)eE-.£(j)=e e=o (j,k)eE-x(j)=e 

= E( E f^ t+ \j)- E 

<=1 \(i,j)eE:£(j)=< (j,k)eE:C(j)=t 

+ E /^« t+1 (i)- E kku t+ \j). (57) 

(i,j)eB:£(j)=I,+l (j,k)eE:C(j)=0 

First, we show that any term between 1 and L is zero. For any 1 < I < L, by summing over nodes in level I: 

E hy +1 u)- E W +1 w = E ( E A^ t+1 (i)- E 

(i,j)eE-.c(j)=e (j,k)eE-x(j)=i r-C(j)=i \i-.(i,j)eE k-.(j,k)eE 

(58) 

= E " t+ W E fa- E A* ) • (59) 

j:C(j)=£ \i:(i,j)£E k:(j,k)eE J 

By Lemma 9, j)eB /ij = J2k-.(j,k)eE fj,k> s ° thes e terms are zero, leaving: 

E hAu t+ \j)-u t+ \i))= e E (60) 

If £(.?') = 0, then j = root: 

E /i,,(H* +1 (j)-n*+ 1 (i))= E /i,i« t+1 0')- E /^«* +1 (root). (61) 

(»,J')€B (i,j)eE:£(j)=i+l (j,k)eE:C(j)=0 

Moreover, for any j, u t+1 (j) — u t+1 (root) < A, so: 
E /ij(« t+1 (j')-«* +1 W)< E /ij(«* +1 (ioot) + A)- E f^u t+1 (root) (62) 

(ij)e-E (i,j)e&£0)=L+l (j,k)eE:C(j)=0 

<A E /<, ' "'' 'mot, I E E A* 

(i,j)eB:£(j)=I,+l \(ij)eE:£(j)=L + l {j,k)£E:C(j)=0 



(63) 
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By Lemma 12, Equation (37), the flow into level L + 1 is less than or equal to the flow out of level 0, so the last part 
is nonpositive and: 

E /w(«* +1 0')-«* +1 (i))<A ]T (64) 

(i,j)€B (i,j)eE:C(j)=L+l 

From Lemm 12, Equation (36), we can show that the second term of Equation (53) equals: 

b E / M >&(i + i) E < 65 > 

(i,j)€E (i,j):C(j)=L + l 

Putting Equations (65) and (64) together with the fact that b = A/(L + 1), we get, 

E /w(« t+1 (j)-« t+1 (i)-&)<A E + E ( 66 > 

(i,j):£(j) = L + l ( lJ ):£0) = L+l 

<(A-6(L + 1)) E &i=° < 67 > 

(»,j):£(j)=i+l 

which is what we were trying to prove. □ 

Lemma 13 is very close to the Blackwell condition, but not identical, so we sketch a quick variation on a special 
case of Blackwell's theorem so we can apply it to our problem. 

Fact 14. (a + b)+ <a+ +b+ 

Lemma 15. [(a + b)+} 2 < (a+ + b) 2 

Proof. 1. Ifa,6> 0: (a + b) 2 < (a + b) 2 

2. If a, b < 0: [(a + b)+] 2 = < (a+ + b) 2 . 

3. If a > 0, b < 0: if -b > a, then [(a + 6)+] 2 = < (a+ + 6) 2 , otherwise [(a + &)+] 2 = (a + fc) 2 = (a+ + &) 2 . 

4. If a < 0, 6 > 0: then if -a > 6, then [(a + 6)+] 2 = < (a+ + 6) 2 , otherwise, [(a + fe)+] 2 = (a + 6) 2 < 6 2 = 
(a+ + 6) 2 . 

□ 

Fact 16. //a i= i... n > f/zen £" =1 a » < ELi a f • 

Fact 17. £[Jf] 2 < E [X 2 ] 

We restate Theorem 3 from Section 4: 

Theorem 3. For any directed graph with maximum out-degree D and any designated vertex root, (A/ (L + 1), L)- 
regret matching, after T steps, will have expected local swap regret no worse than, 



2^ L localswapj — Zy -)- 1 a/T" 



^[-^Lalswap] < r | i H /= (13) 



w/iere £ L = {(ij) e E\C{i) < L}. 
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Proof. 



E[R 



localswapJ 



E 



E 



E 



= E 



< E 



= E 



< E 



^( max f lla^i^W-^))] 

y( max t. + f 1(^=^11 

i G y \ \t=i / 



max i?, 



V Vl(«' = 1 )i + max 



6T + V max jjJV*" 

w+E E 



= «r+ E 



(i,j)eE L 



(68) 
(69) 
(70) 
(71) 
(72) 
(73) 

(74) 
(75) 



By Facts 16 and 17, 



<bT+[\E L \ E 



K i,3 



(76) 



<KT+ |^| J] B [(^ + ) S 
We can bound the inner term as follows, using Lemma 15: 

£ ^te) + ) 2 l< E s[(^ J - 1 - + + l(a r =i)(« T C?)-ti T (0-6)) S 



(»,j)eBi 



E * K liH 

(i,j)EE L 1 



y: s[(i(o T =o(ti T (j)-« T (i)-fc)) a ] 

(i,j)eE L 

+ E s[2flr j - 1 ' + l(a r = i)(« r 0-)-« T (<)-6)" 

(i,j)eE L 



E * (^' T 

(i,j)eE L L 



£ (l(a T = i)(« T (i)-^(i)-6)) 2 

(i,j)eE L 



+ 2 E 

a 1 -- 

Pr[a 



1.....T-1 ,.1,...,T-1 



(i,i)eE L 



(77) 

(78) 
(79) 



(80) 



a 1 7 ; .// ; ' 



,T-1 



1,...,T-1 1,...,T-1 



]) 



(81) 
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By Lemma 13, j)eE L R-Jj 1 ' +w T( uT (j) — u T (i) — b) < regardless of the previous history. 

E E\i^fr\< E ^[(^ + ) 

(i,j)eE L 1 

* E ^[(Cr 1 ") 



(i,j)eE L 



E 



x: (i(« T =*)(A-fe)) 2 



(i,j)eE L 



< TD(A — b) < TD I A 



+ £>(A-6)< 



Putting these two pieces together, we get, 



(82) 
(83) 
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B Proof for Color Regret 

Requirement 3. Let C be a countable (but possibly infinite) set of colors. The edge coloring c : E — > C is such that 

c(i,j) = c(i,k) &j = k. 

We restate Theorem 8 from Section 5.2: 

Theorem 8. For an arbitrary graph G with maximum degree D, arbitrarily chosen vertex root, and edge coloring 
C, (A/(L + 1), L, C)-colored-regret matching applied after T steps will have expected local colored regret no worse 
than, 
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where C L = {C € C\3(i,j) e C s.t. C(i) < L}. 

Proof First, we show that £ ceC ^ + E(i,i)eE V +1 (j) - u t+1 (i) - b)) < 0. 

c(i,j)=c 
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By Lemma 13: 



]r^+ ^ Tr 1 (« m (j)-« m (o-fc))= E wn)-« Hi (i)-i))<o (9i) 



cGC (i,j)€E 
c(i,j)=c 



19 



Now we can bound our quantity of interest. 
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We can bound the inner term as follows, 
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Because only one action is taken, and for each color only one edge originating at an action can have that color, 
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Putting these two pieces together, we get, 
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□ 



C Decision Tree Graphs 

A decision tree is a representation of a hypothesis. Given an instance space where there are a finite number of binary 
features, a decision tree can represent an arbitrary hypothesis. We describe decision trees recursively: the simplest 
trees are leaves, which represent constant functions. More complex trees have two subtrees, and a root node labeled 
with a variable. A subtree cannot have a variable that is referred to in the root. 
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We define Tk(S) recursively, where Tk(S) will be the set of trees of depth k or less over the variable set S. Define 
the set T (S) = {true, false}. Define T k (S) such that: 

T k (S) = r fc _!(S) |J ({s} x T fe _!(5\{ S }) x T fc _x(S\{s})) (114) 
ses 

Define T*(S) = T\ S \(S) to be the set of all decision trees over the variables S. Three example decision trees in 
T*({xi, X2}) are (xx, true, false), true, and (xi, (x 2 , true, false), false). Suppose we have an example x, mapping 
variables to {true, false}. For any tree t, we can recursively define t(x): 

1. If t e T , theni(x) = t. 

2. If t e Tfc and x(tx) = true, then t(x) = t 2 {x). 

3. If t e T fe and x(tx) = false, then f(x) = t 3 (x). 

Define P = {p e (5 x {true, false})l s l : Vi 7^ 7^ t0 be the paths in the trees without repeating 

variables. We can talk about whether a path is in a tree. Define V p (t) to be a function from T* to S U {true, false, 0}, 
where V p (t) = if the path p is not present in the tree, and otherwise V p (t) is the value of the node at the end of the 
path. Formally, 

<115) 

( if t € T or tx ^ v 
V(v,l)o P (t) = < ^ p (t 2 ) if t £ T and t x = v and Z = true (116) 
[ ^p(*3) if i ^ lb and t\ = v and / = false 

Given a path p£?,a tree t' e T*, define R p . t >(t) to replace the tree at p with i' if V^(i) 7^ 0. Formally: 

R$,t>(t)=t' (117) 

{t if t e T or tx 7^ w 

(*i,-Rp,t'(*2),*3) if * ^ T andtx = v and Z = true (118) 
(tx, *2, R P (t3)) if t ^ To and fx = « and Z = false 

Consider the following operations on decision trees: 

1. ReplaceWithNode(p, v, l\, Z2) = R p .(v.i 1 .i 2 ) (where it applies): If there exists a node or leaf at path p, replace 
it with a decision stump with variable v, with label Zx on the true branch, and label I2 on the false branch, 
but only if V p (t) ^ v. 

2. ReplaceWithLeaf(p, l\) — R Pt i t : If there exists a node or leaf at path p, replace it with a leaf l\. 

These operations create the edges between trees: we will determine how to color them later. Because ReplaceW ithN ode 
is a more complex operation, an edge created by ReplaceW ithN ode will have length 1.1, whereas ReplaceW ithLeaf 
will have length 1 .0. This weighting is important: otherwise, consider the following sequence of trees: 

(X, true, false) 

(false) 
(X, false, true) 

If splitting was the same length as changing leaves, this bizarre path would be a shortest path between (X, true, false) 
and (X, false, true). In general, when designing this distance function over trees, a critical concern was whether unnec- 
essary reconstruction would be on a shortest path. For example, a shortest path from (X, (Y, true, false), (Z, false, true)) 
to (X, (Y, false, true), (Z, true, false)) could pass through false, (X, true, false), (X, (Y, false, true), false). But, since 
replacing something with a decision tree costs slightly more than changing a leaf, we avoid this. 

More generally, if the decision about whether or not an edge is on the shortest path can be made locally, then this 
reduces the number of colors required. Thus, massively reconstructing the root because the leaves are wrong is not 
only counterintuitive, it makes the algorithm slower and more complex. 
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We first hypothesize a shortest path distance function between trees based on these operations, and then we will 
prove it satisfies the above operations. Note that this function is not symmetric, because the shortest path distance 
function on a directed graph is not always symmetric. 

Given two decision trees A and B, a decision node a in A and a decision node b are in structural agreement if 
they are on the same path p, and they are labeled with the same variable. A decision node in B that does not agree 
with a decision node in A is in structural disagreement with A. Given a leaf in B that has a parent that is in structural 
agreement with A, if the leaf is not present in A, it is in leaf disagreement with A. 

Define d* (A, B) to be the structural disagreement distance between A and B, the number of nodes in B that are in 
structural disagreement with A. Define (A, B) to be the leaf disagreement distance between A and B, the number 
of leaves in B in disagreement with A. Define d* (A, B) = l.ld* (A, B) + d ; * (A, B). 

Intuitively, this distance represents the fact that an example shortest path from A to B can be generated by first 
fixing all label disagreements between A and B, and then applying ReplaceW ithN ode to create every node in B that 
is in structural disagreement with A (correctly labeling leaves where appropriate). 

Fact 18. If d : V x V — > Z + is the shortest distance function on a completely connected directed graph (V, E), then 
for any i,j e V where (i, j) £ E, there exists a k such that (i, k) € E and d(i, j) = d{i, k) + d(k, j). 

Theorem 19. d* : V x V Z + corresponds to the shortest distance function on a completely connected directed 
graph (V, E) if there exists a A > and a 5 = A/2 such that the following properties hold: 

1. For all a,beV, d* (a, b) = 0iffa = b. 

2. For all a,beV, d* (a, b)>5iffa^b. 

3. For all a,b <G V, if a ^ b there exists a c E V such that d* (a, c) < A and d* (a, b) > d* (a, c) + d* (c, b). 

4. For all a, 6, c e V, if d* (a, c) < A, then d* (a, b) < d* (a, c) + d* (c, b). 

Proof. Observe that the graph (V,E) with edges E = € V 2 : d*(i,j) < A} where the weight of an edge 

€ E is d* (i, j), is a good candidate for the graph under consideration. We prove this in two steps. We first prove 
by induction that d(i,j) < d*(i,j). Then, leveraging this, we prove by induction that d(i, j) = d*(i, j). 

First, we prove that if d*(i,j) < A, then d(i,j) = d*(i,j). First, observe that if d*(i,j) = 0, then i = j, so 
d(i,j) = 0. Secondly, if d*(i,j) e (0, A], then there exists an edge € E so d(i,j) < d*(i,j). Since each edge 
is larger than A /2, for any path of length 2 or greater, the length is larger than A, so only a direct path can be less than 
or equal to A. This establishes that there is no path between i and j shorter than the direct edge. 

For any nonnegative integer k, define P{k) to be the property that for any i, j e V, if the distance d*(i,j) < kS, 
the shortest distance between two vertices in this graph d(i,j) is less than or equal to d*(i,j). This holds for P(0), 
P(l), and P{2) because of the paragraph above. Now, suppose that P(k) holds for k > 2, we need to establish it 
holds for P(k + 1). Consider some pair € V where d*(i,j) € (kS, (k + 1)5], then i ^ j, and by condition 3, 
there exists a k where d*(i, k) < A and d*(i,j) > d* (i, k) + d* (k, j). Since d*(i,j) < (k + 1)5 and d*(i,j) > 5, 
d*(k,j) < k5, so d*(k,j) = d(k,j). From the paragraph above, d(i,k) — d*(i,k), so d*(i,j) > d(i,k) +d(k,j), 
and by the triangle inequality on d, d*(i,j) > d(i,j). 

Thus, since for all (i, j) e V there exists a k where d*(i,j) < k5, for all (i,j) € V, d(i,j) < d*(i,j). 

Next, we prove that if d(i,j) < A, then d(i,j) = d*(i,j). First, observe that if d(i,j) = 0, then i = j, so 
d*(i,j) = 0. Secondly, if (i,j) ^ E, then the distance between i and j must be greater than A, because each edge 
is larger than A/2. Therefore, if d(i,j) € (0, A] there is a direct edge between i and j with distance d*(i,j), so 
d* {i, j) < A, and so by the second paragraph d(i,j) = d* (i, j). 

Define Q(k) to be the property for any (i, j) e V, if d(i, j) < kS then d(i, j) = d* (i, j). Q(0), Q(l) and Q{2) 
hold from the above paragraph. Now, suppose that Q(k) holds for some k > 2, we need to establish the property 
for Q(k + 1). Consider some pair (i,j) e V where d(i,j) <G (kS, (k + 1)5], then i ^ j, and by condition 18, 
there exists a k where there exists an edge from i to k and d(i,j) = d(i, k) + d(k,j). Since there exists an edge 
(i,k), then d(i,k) < A and d(i,k) = d*(i,k) > 5. Thus, d(k,j) < 5(k + 1) - 5 < 5k. so d(k,j) = d*(k,j). 
Moreover, by condition 4, d* (i,j) < d* (i, k) + d* (k, j) = d(i, j). Thus, since we know that d* (i, j) > d(i, j), then 
d*(i,j) = d(i,j). 

Therefore, since d*(i,j) = d(i,j), and d is the shortest distance for graph (V, E), then d*(i,j) is a shortest 
distance function for a weighted graph. □ 

Lemma 20. For the decision tree metric d* above, for any two trees A, B where A =/= B, there exists a tree C such 
thatd*{A,C) < 1.1 and d* (A, B) > d*{A,C) + d*{C,B). 
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Proof. If B has a leaf at the root, then set C = B. 

Suppose that, given A and B, there is label disagreement. Find the a node with label disagreement, and correct all 
the labels in A to form C. This reduces the number of nodes with label disagreement by one, and the decision node 
disagreement stays the same. 

Suppose that, given A and B, there no label disagreement, but there is structural disagreement. Then select a node 
d which has decision node disagreement. Define C to be a tree where we replace node d with the corresponding node 
in tree B, with leaves that agree with the children of d if d has children, and arbitrary otherwise. This reduces the 
structural disagreement by one. It does not increase the label disagreement, because if d has children with labels in B, 
it has those same children in C. 

Finally, if A and B have no label disagreement or structural disagreement, then they are the same tree and have 
distance 0. □ 

Before proving a lower bound, we focus on a particular case. Namely, that changing a correct decision node of a 
tree to have the wrong variable cannot decrease the distance. 

Lemma 21. Given two trees A and B and a subtree S in B, if ns is the number of nodes in agreement with B in the 
subtree S, and Is is the number of leaves in disagreement with A in S, then Is < n$ + 1. 

Proof. We prove this by recursion on the size of the subtree S in B. If S is of size 1, then S is a leaf in B, then ns = 
and Is < 1, so the result holds. Suppose we have proven this for all subtrees S' of size less than S. If S is rooted at 
a node in disagreement, then n s = and l s = 0, and the result holds (we don't need induction for this case). If S is 
rooted at a node x in agreement, then define Strue to be the subtree of the node down the edge labeled true leaving 
x, and define <Sf a j se to be the subtree down the edge labeled false leaving x. |Struel < \&\ an d l^falsel < | *S" | , so by 
induction ls ime . < n s true + 1 and h fal&e < n s M&e + !• Since x is a node in agreement, l s = ?s tme + ^s false > and 
therefore: 

Is < n Stme + n Sfalsg + 1 + 1 (119) 

(120) 

Again, since a; is a node in agreement, + n s^ se + 1 = n s> so: 

ls<n s + l. (121) 

□ 

We will use this fact in several places in the resulting proofs. 

Lemma 22. Given two trees A and B which agree on node y, if you change y in A to a node x or leaf to create C, 
then d*(A,B) <d*(C,B) + l. 

Proof. If S is the subtree rooted at y in B, then d* s (A, B) + n s = d* s (C, B) and d*(A, B) - l s = d*(C, B). By 
definition, d*(A, B) = d*(C,B) + 1.1ns — Is- Since y is in agreement, ns > 1. By Lemma 21, we know that 

Is < ns + 1, so 

d*(A,B) = d*{C,B) + l.ln s - Ks + 1) (122) 
d*(A,B) =d*{C,B) + 0.1ns + 1 (123) 

Since n s > 1, 0.1n s > 0.1 > 0, so: 

d*{A,B) <d*(C,B) + l (124) 

□ 

Lemma 23. For the decision tree metric d* above, for any two trees A, B where A ^ B, then for any C such that 
d*(A,C)<A,d*(A,B)<d*{A,C) + d*(C,B). 

Proof. First, observe that C has "one" change from A, which can be that: 
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1 . C has a decision node splitting on variable x where A had a decision node splitting on variable y. 

2. C has a decision node splitting on variable x where A had a leaf I. 

3. A has a node x that was changed to a leaf. 

4. C has a leaf where A had a node. 

In the first case, there is a question of whether or not the decision node y exists in B. If so, then the struc- 
tural disagreement has been reduced by one. However, the leaf disagreement is unchanged or increased by one, so 
d*(A,B) < 1.1 + d*(C,B) = d*{A,C) + d*(C,B). If y is not in B, and a; is not in B, then d* (A, B) = d*(C,B) < 
l.l+d*(C,B) = d*{A,C)+d*(C,B). If y is in B, by Lemma 22, then cf (A B) < d*(C,B)+l < l.l+d*{C,B) = 
d*{A, C)+d*(C,B). 

For the second case, if the new node in C agrees with B, then d*(A, B) = 1.1 + d*(C, B). If the leaf in A agreed 
with B, then d*(A, B) = d*{C, B) - 1< 1.1 + d*(C, B) = d*(A, C) + d*(C, B). If the leaf in A disagreed with B 
and the new node in C disagrees with B, then d*(A, B) = d*(C, B) < 1.1 + d*(C, B) = d*(A, C) + d*(C, B). 

For the third case, if the new leaf in C agrees with B, then d* (A, B) = 1 + d* (C, B) = d* (A, C) + d* (C, B). 
If the node in A agreed with B, then by Lemma 22, d* (A, C) < d* (C, B) + 1 = d* (A, C) + d* (C, B). If the node 
in A disagreed with B, and the new leaf in C disagrees with B, then d*(A, B) = d*(C,B) < 1 + d*(C,B) = 
d*(A,C)+d*(C,B). 

Finally, for the fourth case, if the new leaf in C agrees with B, then d*(A, B) = 1 + d*(C, B) = d*(A, C) + 
d*{C, B). If the leaf in A agreed with B, then by Lemma 22, d*(A, C) < d*(C, B) + 1 = d*(A, C) + d*{C, B). If 
the leaf in A disagreed with B, and the new leaf in C disagrees with B, then there was no change, and this is an illegal 
transition. □ 

Theorem 24. The distance d* as defined above is the distance function for a graph. 

Proof. In order to prove this, we use Theorem 19. First A = 1.1, and 5 = 0.55. 

Observe that by the definition of d*, if two trees are equal, there is no disagreement, and there is zero distance. 
Secondly, by the definition of d* , if there is any difference between two trees A and B, there will be disagreement, 
and d* (A, B) > 1. Thus, Condition 1 and Condition 2 are satisfied. 

Now, by Lemma 20, Condition 3 is satistfied. By Lemma 23, Condition 4 is satisfied. 

□ 

In the graph generated from d*, note that a single label disagreement or a single decision node disagreement results 
in an edge. 

Now, we have to derive colors. 

1. ReplaceWithNode(p,v,li,l2). The path, the variable, and the labels form the color. Note that if the tree 
already has a decision node with label v at path p, this transition is illegal. 

2. ReplaceWithLeaf(p, The path and the leaf form the color. 
Lemma 25. ReplaceW ithN ode{p, v,l\,l2) is on the shortest path to B if 

1. it can be applied to the current tree 

2. the variable v is at the path p in B. 

3. A leaf with the label — 1Z1 is not at the path p o (v, true) in B, 

4. A leaf with the label ~^l 2 is not at the path p o (v, false) in B. 

If these rules do not apply, it is not on the shortest path. 

Proof. Suppose that A is our current tree. Suppose that C — R p / Vi i lt i 2 )(A). 

First, we establish that if the conditions are satisfied, the edge is on the shortest path. Note that if v is at the path p 
in B, and there is a leaf or another decision node at path p in A, then v is in structural disagreement. Therefore, when 
we replace that node with v, we reduce the structural disagreement. However, we must be careful not to increase leaf 
disagreement. If, for any nodes of v in B, they are corrected in A, then leaf disagreement will not increase. Therefore, 
by reducing the structural disagreement by 1, we reduce the distance by 1.1, at a cost of 1.1, meaning the edge is on 
the shortest path. 
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Secondly, we can go through the conditions one by one to realize any violated condition is sufficient. Regarding 
the first condition: if the operation cannot be applied to the current tree, then by definition it is not on the shortest path. 

Regarding the second condition: if the variable v is not on path p in B, but A and B are in agreement at the path 
p, then changing the variable to v will not decrease the distance sufficiently, by Lemma 22, so it is not on the shortest 
path. Secondly, if A does not agee with B on path p, then d*(A, B) = d*(C, B), and thus C is not on the shortest 
path. 

Regard the third and fourth conditions. If the variable v is on the path p in B, but there is some leaf that is a child 
of v in B that is set incorrectly, then the structural distance is decreased, but the leaf disagreement is increased, so 
d*(A,B) = d*{C,B) + 0.1. ' " □ 

Lemma 26. ReplaceWithLeaf (p, h) is on the shortest path to B if it applies to the current tree, and if the leafh is 
at p in B. If these rules do not apply, it is not on the shortest path. 

Proof. Suppose that A is the initial tree, and C = R Pt i 1 (A). If the edge applies, and there is the wrong label or a 
decision node at p, then the label is in disagreement in A, but not in C. There are no other changes, so d* (A, B) = 
d* (C, B) + 1 = d* (A, C) + d* (C, B), and therefore the edge is on a shortest path. 

On the other hand, if there is no leaf at p in B, or the leaf has another label, then this is not the shortest path. 

First of all, if the operator does not apply to A, it cannot be on the shortest path. 

If the label V p (B) ^ l u but A and B are in agreement at the path p, then by Lemma 22 d* (A, B) < d* (C, B) + l = 
d* (A, C) + d* (C, B). If V P (B) 7^ h, and A and B are not in agreement at the path p, then d* (A, B) = d* (C, B) < 

d*(C,B) + l. □ 

Thus, we have established our coloring works for decision trees. 
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